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Abstract 

ChlP-seq, which combines chromatin immunoprecipitation with massively parallel 
short-read sequencing, can profile in vivo genome-wide transcription factor-DNA asso- 
ciation with higher sensitivity, specificity and spatial resolution than ChlP-chip. While 
it presents new opportunities for research, ChlP-seq poses new challenges for statistical 
analysis that derive from the complexity of the biological systems characterized and 
the variability and biases in its digital sequence data. We propose a method called 
PICS (Probabilistic Inference for ChlP-seq) for extracting information from ChlP-seq 
aligned-read data in order to identify regions bound by transcription factors. PICS 
identifies enriched regions by modeling local concentrations of directional reads, and 
uses DNA fragment length prior information to discriminate closely adjacent bind- 
ing events via a Bayesian hierarchical t-mixture model. Its per-event fragment length 
estimates also allow it to remove from analysis regions that have atypical lengths. 
PICS uses pre-calculated, whole-genome read mappability profiles and a truncated t- 
distribution to adjust binding event models for reads that are missing due to local 
genome repetitiveness. It estimates uncertainties in model parameters that can be 
used to define confidence regions on binding event locations and to filter estimates. 
Finally, PICS calculates a per-event enrichment score relative to a control sample, and 
can use a control sample to estimate a false discovery rate. We compared PICS to 
the alternative methods MACS, QuEST, and CisGenome, using published GABP and 
FOXAl data sets from human cell lines, and found that PICS' predicted binding sites 
were more consistent with computationally predicted binding motifs. 

KEY WORDS: Bayesian hierarchical model; ChlP-seq; EM algorithm; Mappability; 
Missing values; Mixture model; Transcription factor; Truncated data; t-distribution. 



*To whom correspondance should be addressed raphael.gottardo@ircm.qc.ca 



1 



1 Introduction 

ChlP-seq combines chromatin immumoprecipitation with massively parallel short-read se- 
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quencing overcomes certain limitations of profiling with microarrays (ChlP-chip), it raises 
statistical and computational challenges, some of which are related to those for ChlP-chip, 
and others that are novel. A typical ChlP-seq data set consist of millions or tens of millions of 
sequence reads that are generated from ends of DNA fragments. Read lengths are currently 
typically in the range of 36-50 bp, and the quality of called bases varies along and between 
reads; as the sequencing technology evolves, read lengths and quality, and the number of 
sequence reads generated in a machine run are increasing. While pairs of end reads can be 
generated from each DNA fragment, current ChlP-seq data typically consist of single-end 
reads, in which each immunoprecipitated DNA fragment contributes a directional read from 
only one randomly selected fragment end. 



After read sequences have been aligned to a reference genome ( iBarski and Zhad . l2009l ) , 
the goal of subsequent analysis is to transform the aligned read data into a form that re- 
fiects the local density of immunoprecipitated DNA fragments, and, in the work described 
here, to estimate locations where transcription factors were associated with DNA in the 
experimental cellular system. Analysis is complicated by biases in local read densities that 
can be introduced by sequencing and aligning, and by chromatin structure and genome copy 
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and reads that cannot be uniquely aligned are rejected. In typical mammalian ChlP-seq ex- 
periments, 30 to 40 percent of reads may be discarded, but higher rates can be encountered 
in particular experiments. Because of ChlP-seq's cost-effectiveness, such global losses are 
usually not an important practical consideration; however, analysis methods typically make 
no corrections for the local biases in aligned read densities that are caused by repetitive 



regions. 
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Certain types of biases in read density profiles can be estimated by sequencing a 'control' 
sample in addition to the immunoprecipitated 'treatment' sample, and th en using an analysis 



method that considers the treatment profile re lative to the control profile (IKharchenko et al. 



2008 



Rozowsky et al.l . 12009 : 



Nix et al 



20081 ). Considering control data can help identify 



enriched regions that are false positives, assess numerical background models, and estimate 
a threshold for segmenting a read density or 'enrichment' profile in order to identify a subset 
of significantly enriched regions. Analysis methods can be described as 'two-sample' when a 
control data set is available and 'single sample' when only treatment data are available. 

In summary, once reads have been aligned to a reference genome, there are at least four 
central analysis issues: interpreting the information in local densities of directional reads; 
identifying which high local read densities represent false positives; addressing biases in read 
densities that arise from local variations in the efficiency with which reads can be aligned to 
unique genomic locations; and segmenting the enrichment profile to identify a statistically 
and biologically meaningful subset of enriched regions. 

ChlP-seq uses relatively new sequencing t echnology, and, as was the case while C h lP-ch ip 



developed as an experimental approach (e.g. 



Johnson et al 
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Valouev et al. 
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statistical analysis methods are actively being developed. 
QuEST, a method based on kernel density estimates of the forward and reverse read counts, 
which allows estimating the length of DNA fragments. The separate forward/reverse pro- 
files are then combined to provide an estimate of binding site locations and to quantify the 
enrichment. When control sample data a re available, QuES T can also estimate a false dis- 
covery rate (FDR). Like QuEST, MACS (jZhang et al.l . l2008l ) uses both forward and reverse 
read profiles to empirically model the 'shift size' of ChlP-seq reads, and uses it to improve 
the spatial accuracy of the predicted binding sites. Instead of using kernel density estimates, 
MACS uses a parametri c model based on a dynamic Poisson distribution to identify and 
quantify binding events. lJi et al.l (l2008l ) introduced a 'CisGenome' analysis pipeline for the 
analysis of ChlP-chip and ChlP-seq data. Their method is also based on a Poisson back- 
ground model, but includes functionality not offered by MACS and QuEST, e.g. filtering 
atypical regions, and different types of FDR estimates. 

While these methods have established statistical approaches for ChlP-seq analysis, model- 
based and Bayesian approaches are in earlier stages of development. In the work described 
here, we introduce a method for probabilistic inference of ChlP-seq data (PICS) that is 
based on a Bayesian hierarchical truncated t-mixture model. PICS integrates four important 
components. First, it jointly models local concentrations of directional reads. Second, it uses 
mixture models to distinguish closely-spaced adjacent binding events. Third, it incorporates 
prior information for the length distribution of immunoprecipitated DNA to help resolve 
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closely adjacent binding events, and identifies enriched regions that have atypical fragment 
lengths. Fourth, it uses pre-calculated whole-genome read 'mappability' profiles to adjust 
local read densities for reads that are missing due to genome repetitiveness. For each binding 
event, PICS returns an enrichment score that is relative to a control sample when such a 
sample is available, and it can use a control sample to estimate a false discovery rate. Finally, 
because it is based on a probabilistic model, PICS can compute measures of uncertainty for 
binding site estimates, and these can be used to estimate binding site locations and to filter 
low-confidence regions. 

The paper is organized as follows. In section 2, we introduce the data structure and some 
notation. In section 3, we present our Bayesian hierarchical truncated t- mixture model and 
show how we use it to detect binding events, and to estimate binding site positions and their 
confidence intervals. In section 4, we apply PICS to two published, experimental, human 
ChlP-seq datasets, and compare its results to results from three other methods: QuEST, 
MACS and CisGenome. In section 5, we briefly discuss our results and possible extensions. 



2 Data, Preprocessing, and Notations 



We us ed two ChlP-seq data sets that have been analyzed by other groups. IZhang et al. 



fl2008h . using 'MACS', identified bi nding sites of 



OXA l (hepatocyte nuclear factor 3a) in 



human MCF7 (breast cancer) cells. IValouev et al.l (120081 ). using 'QuEST', identified binding 
sites of the growth associated binding protein (GABP) in human Jurkat T cells. Each data 
set consists of single-end reads for a treatment (ChIP) and a control sample. The FOXAl 
data consist of 3, 909, 507 treatment reads and 5, 233, 322 input DNA control reads, while 
the GABP data consist of 7, 830, 602 treatment reads and 17, 028, 066 control reads. 

Because most of the genome should not interact specifically with a given transcription 
factor, ChlP-seq aligned-read data are usually sparse, consisting largely of regions in which 
few or no reads are observed. Given this, we first pre-process the read data by segmenting 
the genome into candidate regions, each of which has a minimum number of reads that 
aligned to forward and reverse strands. We detect such regions using a w = 100 bp sliding 
window with an s = 10 bp step size, counting the number of forward strand reads in the left 
half and the number of reverse strand reads in the right half. Beginning at the start of each 
chromosome, we retain windows that contain at least one forward read and one reverse read. 
For each chromosome, after merging overlapping windows and removing merged regions with 
less than two forward or reverse reads, we obtain a disjoint set of candidate regions, each 
of which we analyze separately. For the work described here, because DNA fragments are 
often between 100 and 300 bp long after gel size selection, we chose w = 100 bp, and we set 
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s = 10 bp for computational convenience. We tested other values for w and s and obtained 
essentially the same candidate regions. 



3 Model, priors and parameter estimation 

In this section, we use IQa{a,f3) to denote an inverse gamma distribution, and Qa{a,f3) 
to denote a gamma distribution with shape parameter a and an inverse scale parameter 
(3. Similarly, N(yU, cr^) denotes a Normal distribution with mean /i and variance a^, while 
t4(/U, 0"^) denotes a t distribution with 4 degrees of freedom, mean /i and variance parameter 



3.1 Modeling a single binding event 

Having segmented the read data into candidate regions, as described in section [21 we now 
assume that each region contains a single transcription factor binding site. An extension to 
the case of multiple binding sites is treated below. Let us denote by fi and rj the i — th 
and j — th forward and reverse reads in a given region, with i = 1, . . . , n/ and j = 1, . . . , n^. 
Note that the number of forward reads, Uf, and reverse reads, Ur, will typically vary between 
candidate regions. We jointly model the forward and reverse reads as: 



where /i represents the binding site position, 6 is the distance between the maxima of the 
forward and reverse distributions, which corresponds to the average DNA fragment size of 
the binding event in question, and cx/ and ar measure the corresponding variability in DNA 
fragment lengths. Note that this approach differs from that typical for sequencing data, in 
that we do not model the sequence counts, but rather the distributions of the fragment ends, 
for which we have more prior information. Figure [TK displays a candidate region with one 
binding event, along with the corresponding PICS parameter estimates. 

3.2 Modeling multiple binding events 

We use mixture models to address the possibility that the sets of forward and reverse reads 
in single candidate region were generated by multiple closely-spaced binding events. We 
model the forward and reverse reads using t-mixture distributions: 



/j ~ ^4 (yU — 5/2, cTj) and ~ ^4 (/i + 5/2, 
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Figure 1: Binding events in two candidate regions in GABP data. PICS detected one binding 
event in the region in (a) and two binding events in the region in (b). Forward and reverse 
strand ahgned reads are shown by red and green arrowheads, respectively. Mappabihty pro- 
files are shown as black/white lines, in which the white intervals show nonmappable regions. 
In (a) the distribution of reverse reads has been biased by a region with low mappability. 

where fifk = /ifc — and firk = yUfc + ^kf^ and /i^, 6k, CTf^, ark are defined as in ([1]), but 
have an index k that corresponds to the binding event fc, while Wk is the mixture weight 
of component k, which represents the relative proportion of reads coming from the binding 
event k. For simplicity we denote by and Qr the resulting p.d.f. of the forward and reverse 
mixture distributions. 

Figure [T]d displays a candidate region that has two binding events, along with the corre- 
sponding PICS parameter estimates. 

As described in (IT]l2]), PICS uses t distributions with 4 degrees of freedom to model local 
distributions of forward and reverse reads. While the t distribution is similar in s hape to the 
Gaussian distribution, its heavier tails make it a robust alternati ve (ILange et al.l . mm. The 



degrees of freedom are fixed as f = 4 to minimize computation (ILo et al.l . I 20081 ). Note also 
that since a DNA fragment should contribute a forward read or a reverse read with equal 
probability, we use the same mixture weight Wk for both forward and reverse distributions. 
Finally, to accomodate possible biases (e.g. in DNA sonication) that result in asymmetric 
forward and reverse peaks, we use different forward and reverse variance parameters and 
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3.3 Modeling multiple binding events with missing reads 

Building on ([I]l2]), we now consider the case where some reads are missing due to one or 
more non-mappable regions intersecting a candidate region. Once again, for illustration, we 
focus on a single candidate region, whose range is denoted by S. For each chromosome, a 
mappability profile for a specific read length consists of a vector of zeros a nd ones that gives 



an es timated read mappability 'score' for each base pair in the chromosome (IRobertson et al. 



20081 ). A score of one at a position means that we should be able to align a read of that 



length uniquely at that position, while a score of zero indicates that no read of that length 
should be uniquely alignable at that position. As noted, reads that cannot be mapped to 
unique genomic locations are typically discarded. For convenience, and because transitions 
between mappable and non-mappable regions are typically much shorter than these regions, 
we compactly summarize each chromosome's mappability profile as a disjoint union of non- 
mappable intervals that specify only zero- valued profile regions (Figured]). 

Let us assume that a candidate region intersects one or more of these intervals. We 
can write S = IJ^o*^'' where 5*; = [ai,bi] denotes the / — th non-mappable interval, with 
I = 1, . . . , L, and 5*0 denotes the union of intervals that have high mappability, and so should 
have no missing reads. In Si, the fn {i = l,...,nfi) and rij (j = l,...,nri) denote nji 
independent forward reads and riri independent reverse reads. Note that only the quantities 
with / = are observed, while all others are unobserved random variables. Also, note that 
Ufo, riro, and L will vary across candidate regions. 

Based on ([2]), fn and r/j, / = 1, . . . , L, follow a truncated t-mixture model, which is given 
hj Qf and Qr truncated on Si. The only information carried in the mappability profile is the 
location and length of 5*;; these affect the estimation of the model parameters shared between 
the observed and unobserved reads, i.e. w, ^t, S, a-f, and cr^. As we will see in Section HI it 
is possible to account for the missing reads when estimating the unknown parameters. 



3.4 Prior distributions 

Typically the library construction process makes prior information available for the length 
of the DNA fragments, 6k- We can use a Bayesian approach to take advantage of this 
information by allowing the 6k s for all binding sites to derive from a common prior fragment 
length distribution. Similarly, we can also put a common prior distribution on ajj^ and a"^/^, 
which allows us to incorporate prior information about the variability of the DNA fragment 
length within a site and to regularize variance estimates when few reads are available. For 
computational convenience, we use a normal inverse gamma conjugate prior, given by 

(^}k, (^rk ~ ^Ga{(y, p) and {6k\(T%, (jlk) ~ N(^, p^V {a^^ + a;^)) (3) 
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where ^ represents our best prior guess about the mean fragment length across all binding 
sites, and p controls the spread around this guess. Similarly, (3 /{a — 1) represents our 
best prior guess about the variance of the DNA fragment length, and — l)^(a; — 2) 

controls the spread around this prior guess. In the analysis reported here, we choose a = 20, 
(3 = 40000, C, = 175, and p = 1. These values were based on knowing that DNA fragments 
should be on the order of 100-250 bps a fter gel size selection for both datasets considered 



(^Valouev et al. 



2008 



Zhang et al. 



and resulted in a fairly non-informative prior for 
the DNA fragment length, with a mean of 175 bps and a standard deviation of approximately 
50 bps. 

3.5 Parameter Estimation Using the EM Algorithm 

Given the conjugacy of the prior chosen, an Expectation-Maximization (EM) algorithm can 
be derived to find the maximum likelihood estimates (MLE) of the unknown parameter 
vector = (^1, . . . , Ok) where Ok = {wk, Pk, ^k, crjj^, , cr^f,). Our algorithm is similar to those 



used i n t mixture models and Ba yesian regularization for mixture models (IDempster et al. 



1977 



Peel and McLachlan 



20001). In the presence of mi ssing reads, we use an algorithm 



similar to that developed by iMcLachlan and JonesI (Il998l ) for grouped and truncated data. 
In the following text, for ease of notation, we use the letter d to denote either / or r. For 
simplicity, we first describe our EM algorithm when no missing reads are present, i.e. for 
S = Sq, dii = di. 

Complete data likelihood: We consider the 'complete data' to be = {d^, ZdijUfi), 
where Zdi and u^i are the missing data. The newly introduced missing data are: first, the 
unobserved cluster memberships, which are defined as Zdi = {zdn, . . . yZdix) for the reads, 
where z^ik is a binary indicator that the read di belongs to mixture component k; and 
second, the weights Udi = {udn, ■ ■ ■ , Udix), which come from the normal-gamma compound 
parameterization, and are defined by 



{di\Udik — Udik,Zdik — i,fJ'k,Sk] 
{Udik\Zdik = 1) 



N 



dk 



Udik 

gaiv/2,v/2) 



(4) 
(5) 



independently for i = 1, . . . , n^j, where f = 4 is the degrees of freedom of the t distribution. 
The advantage of writing the model in this way is that, conditional upon the it^j's, the 
sampling errors are again normal but with different precisions, and estimation becomes a 
weighted least squares problem. 

The penalized log complete data likelihood, denoted /*, is given by l*{@\y) = l{@\y) + 
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^prior' where l{@\y) is the complete-data log-hkehhood, given by 

my) 

1. J <■ I— J 2 \ 

WfcN Idilfidk,^] Qa{udik%2) 



rid G 

Yl 5Z 5Z ^rfife <) log 

de{/,r} i=l k=l 
na G 



> , > , Zdik < log Wk - log adk - log V 27r — I I + log Udik - 'iudik + log 



dG{/,r} i=l k=l 

and /prior' prior 'penalty' on (5, cr'j, cr,^), is given as 



I 



prior 



0". 



2a -1 



k k 

E-Step: Given the current estimate 0~ for 0, the conditional expectation of the penalized 
log complete data likelihood is given as 

d 



Qi&\&-) = E[/(e|i/)|e-] + /p,io, 

rid K 
dG{/,r} 4=1 A.-1 



logWfc - logO-rffc - 



Udik I di — Hdk 



O'dk 



+ A 



(7) 



where A is a constant wit h respect to the parameter vector 0. Given this, the E-step 
(jPeel and McLachlaru . |2000| ) consists of computing the following quantities 

d , U)kt4,{di\fidk,0'dk) /„N 



Zdik 



^{Zdki\ydi^®' 



Udik = ^{UdiklVdiyZd: 



J2k U!kt4.{di\^dk, O'dk) 

1,0-) ^ 



^ + {di- fidkr/cT'dk 

M-Step: During the M-step, the goal is to maximize Q(0|0~) with respect to 0, which 
requires solving dQ{@\@^)/d@ = 0. 

Unfortunately, there is no simple closed form solution for 0. Given this, we adopted 
a conditional approach in which we first maximize over (w, ^i,<5), conditional on (crj,<Tr), 
and then maximize over (cr/, cTj,), conditional on the previously up dated (w, p,,d), resultin g 
in an Expectation/ Conditional Maximization (ECM) algorithm (IMeng and Rubinl . |2008| ). 
Conditional on afk and ark, we solve a linear system analytically, which leads to the following 
estimates: 

Xfk + Xrk 



Wk 
P'k 



Sfk + Srk 

m 



+ . J k, 



fk + rrirk 2{mfk+Thrk) 
Srkrh;^ - Sfkm-l + p{a^^ + a;^)i{m-l + m^^) 



^+P{^fk +^rk)i^fk 



fk 
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where 

nd rid 

Xdk — ^ Zdik, Sdk — ^ diZdikUdik, T^dk — ^ ^dik^dik- 
i=\ 1=1 i=l 

Conditional on these new estimates Wk, fik, Sk, we can then solve a non- linear system analyt- 
ically. The new estimate of a^^ is the only non- negative root, and is given as 

^dk (^3d - Ci) /C4d, 

where 

rid 



Vdk = 


1=1 




C2d — 






Czf = 


{fif-rir) (2q;-1 + x/) + C2/ 


iXf + Xr) , 


Csr = 


{Vr - Vf) {2a -1+ Xr) + C2r 


iXr + Xf) , 


Cid = 


2C2d (fjf-fjr), 




Ci - 


J[{2a-l){fjf-fjr)+C2fXr 


- C2rXff + 4C2^X/C2/Xr 



Accounting for missing reads: In the presence of missing reads, wc decompose the log 
complete data likelihood as l{@\y) 
likelihood in partition 5";, given by 



complete data likelihood as l{@\y) = '^iLohi®\y), where li{@\y) is the complete-data log- 



ridi G f f ^ _ 

= ^diik { log Wk - log Gdk - log \/27r ^ I — — ) log Udlik - 2udiik + log 

dG{/,r} «=1 k=l k 

We now have additional missing data, ridi and du, corresponding to the number of missing 
reads and the missing reads themselves. To accommodate this, all that we need to change 
is to add two steps to our E-step, as follows. 

Because the unknown counts Udu I = 1, . . . , L, follow a negative multinomial distribution, 
we simply replace them with their conditional expectations, which are given by 

ndl=E{ndl\ydo^,&') = ndoPdi{&-)/Pdo{&-), (10) 

where Pdii@-)^PT{X e Si) = ^ yd(x|0-)dx and Pdo(e-)^ Pr(X e So) = 1-Ef=i 
are the probability measures of the partitions Si and Sq. 
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Second, conditional on the imputed counts n^;, we replace the following quantities with 
the corresponding expectations 



Xdk Xdok + y^^ndi^di[zdik] 
1=1 

L 



Sdk ^ SdOk + y^^ridiE.dilzdikUdik], 
1=1 

L 

<TT'dk 'fndOk + ''^^ndl^dl[dlZdlkUdlk], 

1=1 
L 

Vdk ^ Vdok + ^ndiEdi[{di - fikfzdikUdik] 



1=1 



where Xdok, Sdok, "^dofc, and fjdok are the original quantities as defined in M-step in the case 
of no missing reads, and E^/ are the expectations with respect to the unobserved reads {du, 
I > 0), conditional on observed reads d^i and on previous estimated parameters 0~ (the 
Appendix gives more details of computing these expectations). 

3.6 Inference and Detection of Binding Sites 

Choosing the number of binding events in each region: The EM algorithm described 
above assumes that the number of binding events within a region, K, is known. However, 
in practice, K is unknown and needs to be estimated. For each candidate region, we fit our 
PICS model w i th taking values from 1 to 15, and select the value of K that has the largest 



BIC (ISchwarzl . 



19781 ). which in our case is given by 

BIC = -2Q{& = &\&) + {5K - 1) ln(n/o + ri.o), (H) 
where is the final estimate for the parameters 0. 

Uncertainty of parameter estimates: It is useful to extend the point estimates for 
the parameters of interest, /x and 6, by deriving measures of uncertainty for them. Within 
our framework of mixture models with truncated data, we derive an approxir nation of the ob- 
served information matrix for the parameters using the approach described in 



McLachlan and Krishnan 



( 119971 ). Using the observed information matrix, we can then obtain approximate standard 
errors for both p, and S. We can use these standard errors to, for example, define the starts 
and ends of binding event neighborhoods, filter out noisy enriched regions and estimate 
confidence intervals for binding site point locations. 

Binding event neighborhoods: Because PICS models local concentrations of bidirec- 
tional reads, we can define 'high confidence' neighborhoods whose extents are given by the 
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maxima of forward and reverse density distributions. Using our PICS parameters, and tak- 
ing into consideration the standard errors of the estimates, for a given binding event this 
neighborhood is defined as the interval /i ± 5/2, extended by three standard errors on each 
side, i.e. (SE(/i — 5/2) for the left limit and SE(/i + 5/2) for the right limit). These high 
confidence neigh borhoods can defin e 'enriched' regions in a file that can be visualized in a 
genome browser Jxuhn et al l, bood l 



Peak merging and filtering: We use BIG to estimate the number of binding events within 
each candidate region. While BIG is well suited to selecting the number of mixture compo- 



an underlying i 


3roba ; 


(Baudrv et al. 


2008) 



tains hundreds of reads, BIG may select a model that has too many components in obtaining 
a good fit to the underlying density. To address this, we merge peaks that have overlapping 
binding events, as defined by the start and end positions defined above. The parameters of 
the merged peaks are obtained by moment matching conditions (see appendix). Since the 
combined parameters ^t and 5 are linear combination of the original ones, the original infor- 
mation matrix can be used to recompute the standard errors. For the GABP and FOXAl 
data described below, this approach merged less than 1% of the binding events. 

In addition to merging overlapping events, we also filter out binding events that have noisy 
or atypical parameter estimates, which could potentially affect the downstream analysis. 
Specifically, we remove binding events that fail to satisfy any of the following three criteria: 

(i) SE(/i) < 50; {ii) 50 < 5 < 200; (m) a/,a^ < 150. 

Essentially, (z) filters events that have noisy binding site position estimates, {ii) filters events 
with atypical average DNA fragment le ngth estimates (e.g. e vents that have high fractional 



20081 )). and {iii) filters events with 



overlaps with simple tandem repeats (jJohnson et al 
large DNA fragment length variability. 

Scoring and ranking binding events: In order to identify and rank a statistically 
meaningful subset of binding events, we define an enrichment score for each binding event. 
For a given event, we define Fcmp and Rchip, the number of observed forward/reverse 
GhIP ('treatment') reads that fall within the 90% contours of the forward/reverse distri- 
butions, i.e. within fi^ =t 2.13(7^. We assign an enrichment score to each binding event as 
s = mm{Fchip, Rchip)- When a control sample is available, we similarly define Fcontroi and 
Rcontroi, by computing the number of observed forward/reverse reads in the control sample 
that fall within the 90% contour of the forward/reverse distributions estimated from the 
GhIP sample. Using this information, we define an enrichment score for the treatment rel- 
ative to the control as s = {N control /Nchip) ■ '^v^{d=F,R}{{dchip + l)/((4ontroZ + 1))), where 
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the addition of the constant one prevents a division by zero. The scahng of the enrichment 
score by Ncontroi/Nchip accounts for the control and ChIP samples having different numbers 
of reads. 

False discovery rate: Given control data, we can estimate the false discovery rate as a 
function of the enrichment score. We do this by simply repeating the analysis after swapping 
the control sample for the ChIP sample and recomputing our enrichment scores, which we 
call 'null' enrichment scores and denote by Sq. Then the FDR, as a function of the threshold 
value q, can be computed as follows: 

{#s : s > g} 

4 Application to experimental datasets 

We applied PICS to the two experimental data sets described in section [21 obtaining 58,622 
candidate regions and 60,087 binding events for GABP data, and 32,287 candidate regions 
and 32,418 binding events for FOXAl data. Table [T] summarizes the number of binding 
events, broken down by the number of mixture components detected in the corresponding 
candidate region. Most of the candidate regions were estimated to contain a single binding 
event, but a non negligible number may contain more than one event. For example, for 
GABP, 2274 of the binding events that PICS detected were in two-event candidate regions. 
The table also suggests that PICS' mixture model was effective in discriminating closely- 
spaced binding events, as, for example, for the top-ranked 5000 GABP events, 79 percent 
of events in two-component regions were associated with a predicted GABP motif site (see 
below for details about motif sites). 

Figure [2] shows histograms of estimated average DNA fragment lengths 6 for the top- 
ranked 10000 filtered and unfiltered enriched regions. We considered only this subset, be- 
cause, based on the estimated FDR (Figure [3]), the other regions are likely to be false posi- 
tives. For the F OXAl data the est imated average fragment size was approximately 150 bps, 
consistent with 



Zhang et al 



(I2OO8I ): it was somewhat smaller for the GABP data. Figure [2] 
also shows that most of the regions had DNA fragments between 50 and 200 bps, which 
supports our filtering atypical regions by this parameter. 

We now compare the performance of PICS and the QuEST, MACS and CisGenome 
analysis methods, using the FOXAl data and GABP data. Figure E] shows the relationships 
between the region rank and FDR for the top-ranked 5000 regions for each method. As 
expected, the top-ranked regions for all methods had FDRs whose values were very small or 
zero. While CisGenome was consistent in returning the largest number of low-FDR regions 
for both datasets, the responses of the other three methods differed for GABP and FOXAl 
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Figure 2: Histogram of estimated average DNA fragment lengths, 5, in GABP (a) and 
FOXAl (b) data, before and after filtering. For clarity, only results for the top 10000 regions 
are shown. 
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Table 1: Number of PICS binding events found for the GABP and FOXAl data, broken 
down by the number of mixture components detected in the corresponding candidate region. 
The first two rows give, for the 5000 most significant binding events for each data set, the 
number of events identified in regions that had 1, 2 or 3+ mixture components, and, for 
each of these classes of events, the percentage of events that was associated with at least 
one predicted site motif site. For example, in the 5000 top-ranked GABP regions, of the 903 
binding events in two-component regions, 79 percent could be associated with a predicted 
GABP site. 





GABP 




FOXAl 




7^ of components in region 


1 2 


3+ 


1 2 


3+ 


# of events (top 5000 regions) 


3829 903 


64 


4913 74 


3 


% of motifs (top 5000 regions) 


77 79 


73 


81 75 


67 


# of events (all regions) 


56,229 2274 


119 


32,012 266 


9 



data. QuEST's response was markedly different for the two sets of data, being close to 
CisGenome's for GABP, but having the smallest number of low-FDR regions for FOXAl. 
MACS' response was similar to those of QuEST and CisGenome for the first 4000 GABP 
regions, after which its response was approximately parallel to that of PICS. For FOXAl, 
the MACS curve diverged progressively from CisGenome's after approximately 2500 regions, 
then changed slope abruptly at approximately 4500 regions and crossed PICS' curve. PICS 
returned by far the fewest low-FDR regions for GABP data, but its response to FOXAl 
data was intermediate between that of QuEST and MACS for ranks between 2000 and 
approximately 4500. 

Noting that the algorithms could respond very differently to different data sets in terms 
of FDR, we then compared the four methods by identifying conserved DNA sequence motifs 
in the 5000 top-ranked predictions from each method, using 200-bp wide regions that were 
centered on each rnethod's binding site estimates ('peak summits'). For motif analysis we 
used GADEM (ILil . |2009| ). which can process large sets of ChlP-seq regions on a single CPU, 
identifies multiple motifs and adjusts motif widths, and performs well relative to algorithms 
that are more compu tationally demanding. We assessed the de novo motifs using STAMP 
(iMahony et al.l . 120071 ). and retained only 'expected' and biologically relevant motifs. As 
expected, for all four methods, GADEM identified GABP and Forkhead motifs as the dom- 
inant motifs in GABP and FOXAl datasets respectively. For the FOXAl data, regions for 
all methods also contained the binding motif for the FOS proto-oncogene protein. The FOS 
gene family encodes leucine zip per proteins tha t can d imerize with proteins of the JUN fam- 
ily to form the AP-1 complex (IMilde-Langoschl . |2008| ). The AP-1 complex is over-expressed 
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Figure 3: Number of detected peaks at different False Discovery Rate levels for the four 
analysis methods, for GABP data (a) and FOXAl data (b). 
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s (e. g . MCF7) and can inte ract directly with the ER transcription factor 



i n ER positive ee l 
( iMilde-Langosch 

play an important role in ER regulation and to interact with ER (lEeckhoute et al. 



2008 



Cicatiello et al. 



20041 ) ■ Similarly, the FOX Al protein is known to 



2006 



Lupien et all l2008l ). The FOS motif that we identified was consistent with AP-1 enriched 



motifs reported for ChlP-chip FOXAl regions iLupien et al.l (120081 ) and may reflect interac- 
tions, possibly indirect, between the FOS and FOXAl proteins. All other motifs identified 
by GADEM appeared to be due to repetitive elements. For the work described here, we 
used GABP motif occurrences for evaluating GABP results, and both FOX and FOS motif 
occurrences for evaluating FOXAl results. 

We evaluated the four methods using two criteria: 1) the motif occurrence rate, i.e. the 
fraction of enriched regions that contained a biologically 'expected' motif, for which a larger 
value indicates better performance; and 2) the spatial accuracy, i.e. the distance between a 
binding site point estimate and a motif occurrence, for which a smaller value indicates better 
performance. Because a motif can occur more than once in a sequence, we used only the 
motif instance closest to the predicted binding event (peak summit) when computing the 
spatial accuracy. 

Figures |l^,b show the motif occurrence rate and spatial accuracy as a function of the 
region rank, for each methods' top-ranked 5000 enriched GABP regions. PICS had the 
highest motif occurrence rate for ranks above approximately 3500, below which PICS' and 
MACS' rates appeared comparable. MACS' rates were intermediate for ranks between 1000 
and 3800, but below QuEST's rate for ranks above 1000. Rates for QuEST and CisGenome 
were lower, and were comparable for ranks below 2000. PICS and MACS had the best 
spatial accuracy, with PICS more accurate for ranks above 2000, followed by QuEST and 
CisGenome. 

Figures [5K,b show motif occurrence rate and spatial accuracy for the FOXAl data. Con- 
sidering both metrics over the full range of the top 5000 regions, the relative performance of 
the four methods was generally similar to that for GABP data: PICS, followed by MACS, 
QuEST and then CisGenome. 

Because cells can use multiple closely-spaced transcription factor binding sites to establish 
progressive expression responses to cellular signals, we assessed how effectively PICS' mixture 
model can detect closely adjacent binding sites. Using our predicted transcription factor 
binding motifs for the top-ranked 5000 PICS predictions for GABP and FOXAl data, we 
determined the percentage of binding events from single- and multiple-component candidate 
regions that could be associated with at least one motif site. Table [1] shows these results as 
a function of the number of mixture components in a region. Far more GABP regions than 
FOXAl regions had two components (903 vs. 74) or at least three components (64 vs. 3). 
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Figure 4: Motif occurrence rate and spatial accuracy for GABP data, as a function of 
enriched region rank, for the 5000 top-ranked regions for each method. 
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Figure 5: Motif occurrence rate and spatial accuracy for FOXAl data, as a function of 
enriched region rank, for the 5000 top-ranked regions for each method. 



19 



Table 2: Number of proximal binding events found by in the 5000 top-ranked regions iden- 
tified by each method in GABP and FOXAl data, as a function of the motif 'proximity' 
distance d. The numbers in paratheses give the percentage of binding events that could be 
associated with at least one predicted motif site. For example, the first row (rf = 250) gives 
the number and percentage of events that had at least one other binding event within 250 
hp. 



d 




GABP 






FOXAl 




PICS 


QuEST MACS 


CisGenome 


PICS 


QuEST MACS 


CisGenome 


250 


188(73) 


405(63) 





6(83) 


269(67) 





500 


376(71) 


950(63) 





26(70) 


361(68) 





1000 


478(70) 


1074(63) 


128(64) 


75(78) 


443(66) 






For both data sets, the percentage of binding events that was associated with a predicted 
binding motif was relatively insensitive to the number of mixture components in a region. 
These results suggest that our mixture model was effective in distinguishing biologically 
meaningful proximal binding events. 

To assess the ability of the other methods to detect proximal binding events we generated 
a similar table, but this time considered binding events that had at least one other event 
within a fixed distance d. Table [2] summarizes the results for d = 250, 500 and 1000 bps. 
For these data, PICS and QuEST were the most effective at identifying proximal binding 
events, and a large fraction of these events was associated with a predicted motif site. While 
QuEST predicted the largest number of proximal binding sites, a larger fraction of the 
mixture components reported by PICS were associated with predicted binding motifs. For 
these data, MACS and CisGenome were less effective at discriminating closely spaced binding 
events. 

As described in section HJ PICS can compute approximate standard errors for its model 
parameter estimates. In particular, we can derive an approximate confidence interval for a 
given predicted binding event location as /t ± c ■ SE(/i), where c is a constant to be chosen 
as a function of the coverage desired. Assuming that fi is approximately normal, c = 1.96 
should give us an approximate 95% confidence interval for our binding site position. 

Using the set of motifs identified by GADEM, we evaluated the actual coverage of our 
confidence intervals for different values of c. Figure E] shows the occurrence frequency of 
GABP motifs (left) and FOXAl motifs (right) within c ■ SE(/i) of peaks centers. Using 3 
standard errors, the coverage was approximately 65% and 80% for the GABP and FOXAl 
data. While these numbers suggest that the current version of PICS provides a capable 
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(a) 



(b) 




modeling framework, they also suggest that there are significant opportunities to address 
noise and biases in more depth in order to improve spatial accuracy. 

Finally, we evaluated the effect of the mappability profiles on the parameter estimates. 
We re-did the analysis while ignoring mappability, and compared the spatial accuracy, i.e. 
the distance to the closest computationally verified binding site, with and without the map- 
pability correction. Figure [7] shows boxplots of the difference between corrected and uncor- 
rected estimates for various percentage of missing reads. The bloxplots are skewed to right, 
which shows that the correction improved the estimates for binding event locations, and the 
degree of improvement increased with the fraction of missing reads. 



5 Discussion 

We have developed PICS, a probabilistic framework for detecting transcription factor bind- 
ing events from ChlP-seq experiments. The approach integrates a number of important 
factors in interpreting aligned read data, including correcting for reads that are missing 
due to genome repetitiveness and using prior information on input DNA fragment lengths. 
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Figure 7: Using mappability improved spatial accuracy for (a) GABP and (b) FOXAl data. 
The Y-axis shows how correcting for missing reads in predicting a binding site changed the 
distance between the site and the predicted binding motif closest to it. A positive (negative) 
value indicates that using the mappability correction decreased (increased) the distance 
between a site and its closest motif. For each data set, six relative levels of correction are 
shown, e.g. for 200 GABP binding event regions, the final number of estimated reads in 
each region included at least 20% of missing reads. 



22 



Working with two published ChlP-seq data sets from human cell lines, we compared PICS 
to three alternative analysis m ethods. While additional methods are available for detecting 
bound regfons 



Nix et al 



2008 



rom ChlP-seg (jFeies et al. 



Rozowsky et al 



2008 



Jothi et al. 



2008 



Kharchenko et al. 



2008 



2009f), the three methods we used have been shown to have 



good performance, and so offer reasonable performance baselines. The results of the compar- 
ison showed that, although the FDR-rank relationships returned differed by method and data 
set, the binding events predicted by PICS were the most consistent with computationally 
identified motif sites in both data sets. 

We showed that PICS' mixture model addresses multiple adjacent enrichment events, and 
can fit a different DNA fragment length value for each binding event in a mixture. While 
we allowed the mixture model to detect up to 15 components per candidate region, we can 
readily adjust this limit. Datasets can be expected to contain regions in which adjacent 
binding sites are too close to be resolved, but, given a DNA fragment length distribution, 
we anticipate that PICS should discriminate most adjacent sites that are resolvable. 

We note that, because it is based on mixture models and accounts for missing reads, PICS 
is computationally intensive. The results shown were o btained with an imple i nenta tion of 
PICS that was written in the R programming language (jihaka and Gentleman! . Il996l ). Pro- 
cessing a lOM read data set required an average computing time of three 3GHz CPU-hours 
per chromosome. While we reduced the overall computation time by treating chromosomes 
in parallel on a multiprocessor machine, and could also use a compute cluster, we are also 
re-implementing PICS in C. We anticipate that this new version will reduce the computing 
time by at least a factor of ten a nd will scale well with la rger datasets. PICS will be made 
freely available via Bioconductor (IGentleman et al.l . |2004| ). 

At the time of writing, all published short read ChlP-seq data are for single end (SE) 
reads, rather than for paired-end (PE) reads. PE data offer more direct information on 
DNA fragment lengths, should resolve a subset of read alignments that would be non-unique 
in SE data, and, in principle, could give direct information abo ut long range chromosome 
interactions and genome rearrangements (IHolt and Jonesl . 120081 ). However, because a PE 
experiment requires more input DNA and is more costly than an SE experiment, it is likely 
that PE and SE data will be appropriate for somewhat different applications. We anticipate 
that PICS will be useful in work to identify optimal applications for each approach, and that 
its probabilistic approach will remain useful for PE data, where having defined fragment 
lengths should simplify the modeling framework. 

As a first step in implementing a probabilistic approach for ChlP-seq data, we have shown 
how to incorporate prior information about the DNA fragment lengths using a Bayesian ap- 
proach. We can extend the PICS framework to incorporate more types of prior information. 
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For example, we could place a prior distribution on the binding site position, and could 
include in this information about nucleosome occupancy and computationally derived motifs. 
Such extensions should allow us to further improve the detection of biologically relevant bind- 
ing sites. With such extensions, we anticipate that probabilistic methods may help ChlP-seq 
contribute to biological research by offering principled ways for addressing backgrounds and 
diverse types of noise, and for integrating diverse types of biological information. 
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Computational details for the missing read case: We calculate expectations Mai with 
respect to the double truncated i-mixture density of unobserved reads as follows: 



^dl[Zdlk] = WkPdi\Q~)H3^dlk 

^di[zdikUdik\ = WkP^i^{Q~)Ho^dik 

^di[diZdikUdik\ = WkPai^{Q'')[2a^Hi^dik + fJ-kHo^ik] 

^di[{di - fikfzdikUdik] = '^kPdi^{&~)mcr^fH2,dik + il^k - P-k^Ho^dik + Ki^k - McTk Hi,dik\ 



The quantities i?'s can be calculated as: 

'bi- 



Hj,dlk 



— I^dk ^ _ ^ ^ — Hdk 
2(Tdk 2(7, 



Tdk ^Cfdk 

H^^dik — Ti{hi\iidk,crdk) —Ti{ai\ndk,CFdk) 

for j = 0,1,2, where T4 refers to the c.d.f. of t distribution with 4 degrees of freedom, 
B = r(3.5)/(r(3)v^) is a constant, and the functions /ij's are defined as: 

hr{x) = B (^Mx)f - ^Mx)f 



h2{x) = 

/i4(x) = 



-B 



2\-2-5 



sin(arctan(x')) 



Pcirameter recalculation when merging binding events: The parameters of merged 
binding events are calculated by solving these moment matching equations: 
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